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ABSTRACT 

The matter density is an important knowledge for today cos- 
mology as many phenomena are linked to matter fluctuations. 
However, this density is not directly available, but estimated 
through lensing maps or galaxy surveys. In this article, we 
focus on galaxy surveys which are incomplete and noisy ob- 
servations of the galaxy density. Incomplete, as part of the sky 
is unobserved or unreliable. Noisy as they are count maps de- 
graded by Poisson noise. Using a data augmentation method, 
we propose a two-step method for recovering the density map, 
one step for inferring missing data and one for estimating of 
the density. The results show that the missing areas are effi- 
ciently inferred and the statistical properties of the maps are 
very well preserved. 

Index Terms — Inpainting, Bayesian framework, Sparse 
representation, Poisson noise, Data augmentation 

1. INTRODUCTION 

Information about the origin of the Universe is encoded in- 
side the cosmological matter distribution. It is an important 
challenge to be able to estimate such a distribution. However, 
the whole matter is not available, only a biased observation 
is possible using count maps estimator of massive objects. 
These maps are degraded both by shot noise and astrophysics 
phenomena (e.g. Milky Way, galactic dust). 

Current methods for reconstructing the density maps are 
more focused on the denoising problem than on the missing 
data. For example, [1 1 propose a Wiener filter for estimating 
the SDSS DR6 survey [2|. A maximum a posteriori was in- 
troduced in [3 1 with a Poisson data fidelity term (including 
the mask operator) and a log-normal prior. 

This last prior comes from Hubble who found in 1934 that 
the distribution of galaxy counts is well fitted by a log-normal 
distribution. It was later confirmed by others studies that this 
statement is correct for a given interval of scales (mostly the 
medium scales) HE]. 

In this paper, we propose both to estimate the density map 
and infer the missing data using a data augmentation process 
151 . First, we present how to generate realistic data with a 
random texture synthesis algorithm. Secondly, a maximum 
a posteriori estimator is proposed for the density using both 



the log-normal prior and a sparse prior. Then, we apply our 
method on real data and compare the results with how state- 
of-art alternatives in the literature. 



Notation and terminology 

We denote by | . 1 1 2 the norm associated with the inner product 
in R™, and I is the identity operator on R™. A function / is 

coercive, if lim 1 1 5 1 1 2 ^ 00 / (<$) = +00. rn(R") is the class of 

all proper lower semi-continuous convex functions from R" 
to ] — 00, +00]. 

Let S € R™ be an n-pixel map. S can be written as the 
superposition of elementary atoms (p^ parametrized by 7 G I 
such that S = X) 7 Gi a 7 ( ' 9 7 = 1^1 = L, L ^ n. 

We denote by 3? the dictionary i.e. the n x L matrix whose 
columns are the generating waveforms (y> 7 ) x all normal- 



ized to a unit ^2-norm 
necessarily square matrix F = $ 



7 ex 

The forward transform is the non- 
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matrix <I> is said to be a frame with bounds ci and C2, 
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A dictionary 

< 



ci ^ c 2 < +00, if c\ 



< C2 ||<5|| . A frame 



is tight when c\ = c-i = c, i.e. <!><I> = ci. In the rest of the 
paper, will be an orthobasis or a tight frame with constant 
c. 

2. RECOVERING THE GALAXY DENSITY 

The galaxy surveys are count maps degraded by Poisson noise 
and parts of the sky are not subject to interferences with as- 
trophysical phenomena and objects (e.g. Milky Way, galactic 
dust). The underlying image formation model is, 



y~p(Mm(l + 5)) , 



(1) 



where y is the observation, 6 the density field, rh the mean 
number of counts (e.g. galaxies) per pixel and M the binary 
mask operator (i.e. where data are missing and 1 else). 

Estimating S from y is an ill-posed problem, so priors are 
needed to reduce the solution set. The density field is as- 
sumed to follow a log-normal distribution ||4], i.e. (1 + S) ~ 
CN{n, S) ~ exp(A/(/i, £)), where the mean fj, and the co- 
variance matrix S are the parameters of the underlying Gaus- 
sian field. 



2.1. A data augmentation method 

Inferring missing data is a longstanding and delicate problem. 
Most used methods in statistics to handle missing data are 
the expectation-maximization (EM) algorithm and multiple 
imputation (MI); see (6) for a comprehensive review. 

As the distribution of S is known, the data augmentation 
method [5| seems to be the most adapted. This is an EM 
scheme (described in Algorithm [Q where, first, the missing 
data are inferred using multiple imputations (i.e. several new 
observations are generated at each iteration). Then, the M- 
step consists in estimating the sought after parameter(s) based 
on the complete formed data. 



Algorithm 1: The data augmentation scheme. 
Task: Data augmentation method for both inferring the 
missing data and estimating the parameters. 
Parameters: The observation y, number of iterations Aj te r 
and the number of imputations A^mi • 
Main iteration: 

Initialization: Compute estimates of the log-normal 
parameters, E and fi, and of the mean number of counts m. 
For t = to iVitcr - 1, 

• E-step: create A^mi complete observations by filling the 
missing area using the prior distribution of the density field 
and the estimated parameters, 

• M-step: for each complete observation estimate the density 
field, the log-normal parameters, E^ and Hi of the field and 
the mean fhi, 

• Update step: each parameter is updated using the estimates 
from the multiple imputations, fi = /x^/ATmi. 

S = (E s SO/AW, m = (£. fhi)/Nui. 
End main iteration 



The multiple imputation is very useful when parameters 
are updated, and the number of imputations is linked to the 
amount of missing data (for example 5 imputations should 
be sufficient for 0.5 ratio [6]). The tricky point remains the 
generation of realistic data. With prior information, one can 
advocate Markov Chain Monte Carlo (MCMC) methods, but 
such methods are usually computationally very expensive. 



where each Ci represents a constraint set. In our case, we want 
to constraint the mean and the covariance of the underlying 
Gaussian field (as log-normality is assumed), but the observed 
parts of the density field must be preserved. The solution is 
then computed by projecting the data onto the constraints sets 
using the Von-Neumann alternating projections algorithm, 

St + i = V Cl oPc 2 oPc 3 (St) , (3) 

where C\ is the convex constraint set pertaining to the ob- 
served part preservation, C2 is associated to the covariance 
constraint (non-convex), and C3 is the mean constraint set 
(convex). Algorithm [2] summarizes the steps of this the syn- 
thesis method for our problem. As Ci is not convex, sophisti- 
cated arguments are needed to potentially prove convergence 
of the sequence (St)tefi to a point in nf =1 Ci (if non-empty). 
This will be left to a future work. In practice, only a few 
iterations were necessary to produce satisfactory results . 



Algorithm 2: Texture generating process for the impu- 
tation step (E-step). 
Task: Generating realistic data to infer the missing data. 
Parameters: The current estimate of the density field S, the 
observation y, the binary mask operator M, the covariance 
matrix E, the mean /i, the mean number of count m, and the 
number of iterations Atex- 
Initialization: po — S, 

Replace the data in the missing areas of p with random data 

using the log-normal distribution. 

Main iteration: 

For t = to At cx - 1, 

• Get the Gaussian field : z t = log(l + pt), 

• Estimate the mean : vn\ = E(zt), 

• Contraint the mean : Zt — Zt — ml + fi, 

• Estimate the covariance : Sf = Cov(zt), 

1 _i 

• Constraint the covariance : zt = E 2 S z 2 z t , 

• Update the estimate: p t = MS + (I - M)(exp(z t ) - 1). 

End main iteration 

Add Poisson noise : p ~ V(fh(l + pjv tax ))- 

Output: The imputed observation y — My + (I — M)p. 



2.2. Generating realistic data (E-step) 

As an alternative to MCMC methods, we propose a texture 
synthesis-like method for creating realistic data inside the 
missing areas obeying the appropriate statistical properties 
underlying the image formation model. Indeed, such data 
should respect the model formation (fl~|i (with M = I) and the 
density field has to follow a given log-normal distribution. 

Inspired by the work of [7 1 in texture synthesis, our statis- 
tical data generation can be cast as a hard feasibility problem, 

find S e n| =1 C, , (2) 



2.3. Estimating the galaxy density (M-step) 

Now, we assume complete observations, 

y ~ V (m(l + <5)) . (4) 

The density field estimation amounts now to a Poisson denois- 
ing problem. By adopting a Bayesian framework and using a 
standard maximum a posteriori (MAP) rule, we combine data 
fidelity with both log-normal prior and sparsity prior. 

The data fidelity term is directly constructed from the anti 



log-likelihood of the multivariate Poisson distribution, 

71 

-it : r, g E" h> J2 /poi S son(r?[i]), (5) 
-y[i] log(?7[i]) + rj[i] if r)[i] > 0, 



if y[i] > 0, /p i SSO n (»?[«]) 



ify[i]=0, /poisson (»?[«]) 



if 6 [0,+ao), 
+oo otherwise. 



where r] = 1 + S. 

The regularization term for the log-normal prior is given 
by anti-log likelihood of the multivariate log-normal distribu- 
tion for a covariance matrix £ and a mean /i, 

</ln(5|m> £) = (6) 

1 

-(log(l + 5)- M ) T E- 1 (log(l + S) - „) + lo g(! + S[i)) ■ 

i=l 

Notice that <?ln is not convex because the log function is 
concave. 

2.3.1. The optimization problem 

The non-convexity of qln can be avoided using a change of 
variable, z = log(l + S) and assuming that the underlying 
Gaussian field is sparse inside the dictionary domain. Then, 
the new optimization problem, with z = 3>a, is, 

(7) 

A* (a) , 



(Pa.-v.v) : min J( a ) 



J : a mexp (*a) + (7I — y) (*a) + 7 ||3?a — mIIJ-i 

where 1 is the vector of ones, £ and /1 the parameter of the 
log-normal prior and its weighting parameter 7, ^ : a H> 

i/'( a W) me sparsity -penalty and A the regularization pa- 
rameter. Notice that, we implicitly assume that the a[i], ^ 
i ^ n are independent and identically distributed. Then the 
solution is given by x = exp(z) — 1 = cxp($a) — 1. 

From (Pa,7,i/>) we can characterize the solution, 

Proposition 1. 

1. Existence: J G ro(R L ), then (P\,-y,ip) has at least one solu- 
tion. 

2. Uniqueness: (Pa, 7,^) has a unique solution if tp is strictly 
convex, i.e. if§> is an orthobasis or ifrp is strictly convex. 

2.3.2. Solving the optimization problem 

We first define the notion of a proximity operator, which was 
introduced as a generalization of the notion of a convex pro- 
jection operator. 

Definition 2 (| 8 1). Let ip g r (R"). Then, for every x G R n , the 
function y 1— ¥ <p(y) + \\x — y | 2 /2 achieves its infimum at a unique 
point denoted by prox^ x. The operator prox^ : R n — > R n thus 
defined is the proximity operator ofip. 

Then, the proximity operator of the indicator function of a 
convex set is merely its euclidean projector. With tp — |.|, the 
proximity operator prox A ^, is the popular soft-thresholding 
(denoted ST) with threshold A. 



If the dictionary $ is a tight frame, then the proximal op- 
erator of its composition with a convex function / is, 

Lemma 3 (|9|). If 3? is a tight frame,i.e. <1><I> T = vY, then f o # £ 
r (R n ) and 



proxj o4 , =1 + 1/ 1 <& T o (proxy —I) o <]? 



(8) 



We also need the proximity operator from the terms of 
both data fidelity and log-normal prior. 

Lemma 4. The proximity operator associated to F : x h-5> 
mexp(i) + (7I — y) T x + 7 \\x — is, 

prox^ x = KT 1 pTax 0ffiexp (K -1 (x + (y - 7I + 7/?E~ 1 ^t) )) , 
with pTox mexp x = log (W(/3m exp(x))/ (/9m)) , (9) 

where W is the Lambert W function [10] and K — I + 7/3E -1 . 

Then, we propose to use the generalization of the Douglas- 
Rachford algorithm presented in [9] in order to solve ((7}. The 
solution is computed using the iterative scheme presented by 
Algorithm^ 

Algorithm 3: Density field estimation, solve (Px,j,il>)- 
Task: Estimate the density field. 

Parameters: The observed image counts y, the mean number 
of count fa, the dictionary <&, the number of iterations /Vest, 
the proximal step /i, the log-normal prior parameter E and fi, 
and the regularizations parameters A and 7. 
Initialization: 

Vi G {0, 1}, P(0ii) = & T y. 

a = * T y. 

Main iteration: 

For t = to /Vest - 1, 

• Data fidelity with log-normal prior (Lemma[5]and|4l: 

f(t,o) = P(t,o) + c _1 # T o (prox MF/2 -I) o *(p (f)0) ). 

• Sparsity-penalty: 

f(t,l) = P rox M A*/2P(t,D = ST MA/2(P(i,l))- 

• Average the proximity operators: £t = (£(t,o) + £(t.i))/2. 

• Choose t G]0, 2[. 

• Update the components: 

Vi G {0, 1}, p ( t+i,i) = P(t,i) + 0t(2& - a t - £ (M) ). 

• Update the coefficients estimate: at+i — at + #t(£t — at) 

End main iteration 

Output: Denoised field <5* = exp(€>ajv cat ) — 1. 



3. RESULTS ON THE 2MASS SURVEY 

As an experiment, we apply our method on the 2MASS 
galaxy survey ifTTl . As we are working on the sphere, the 
dictionary <1> contains the spherical harmonics transform, 
which is an orthobasis. The manually tuned parameters were 

TVtex = 15, A^cst = 40 ,/Vner = 6, 7V M i = 10, 7 = 10- 4 
and A = 10 -3 , that seems sufficient for recovering most of 



the large scales. This method was compared with the inpaint- 
ing method proposed in lfl"2l (denoted M2) which fills in the 
missing data area using both sparsity and a quadratic data 
fidelity. 

The results are pictured by Fig. Q] In order to compare 
efficiently the results, we remove all the spherical harmonic 
modes beyond 200 (i.e. t < 200) of the maps. Method Ml 
gives a better estimation of the inpainted areas, as realistic 
structures has been created inside these areas and the transi- 
tion between missing and observed pixels are invisible. While 
with method M2, transitions can be clearly seen and no struc- 
ture is infer inside the large missing area in the center. 

For the denoised zones, both methods preserve the struc- 
tures and the amplitude. In order to compare the behavior 
of the two methods, we also compare in Fig. [2] the second- 
order statistics of the inpainted maps to the theory [ 131- More 
precisely, we focus on the first modes of the harmonic power 
spectrum of the density field. While at the beginning meth- 
ods Ml and M2 provide similar results, they differ on higher 
modes where the Poisson noise becomes more salient at the 
profit of the Ml method. 

4. CONCLUSION 

An inpainting method is proposed using a data augmentation 
procedure, where the observation is first completed with real- 
istic data. For galaxy density field, we propose to use two pri- 
ors, first we assume that the density field follows a log-normal 
distribution and secondly, the underlying Gaussian field is as- 
sumed to be sparse inside a wisely chosen dictionary. The 
resulting algorithm is able to preserve the second order statis- 
tical properties which is an important feature for astrophysics 
application. 
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Fig. 2. The theoretical (green), Ml (black) and M2 (blue) 
power spectra for the first 100 harmonics modes. 



2MA5S inpainting with MS 




Fig. 1. Results on the inpainting methods on the 2MASS 
count map. Top: The 2MASS noisy density map with the 
missing data in gray. Middle and bottom: The inpainted den- 
sity map using the method Ml (Middle) and M2 (bottom). 



2MASS count map 



